Filled Pause
Research Center

Filled Pause
Research Center

Filled Pause
Research Center

Investigating 'um' and 'uh' and other hesitation phenomena

Investigating 'um' and 'uh' and other hesitation phenomena

Investigating 'um' and 'uh' and other hesitation phenomena

April 30th, 2021

Is there a substantive difference between real-time fluency detection and latent fluency detection?

I have been pushing the uniqueness of the Fluidity application as its capability to detect fluency features in real-time, maintain a constantly updating set of measurements, and adapt the operation of the application accordingly. This is in contrast to most applications in which fluency measurement takes place on a completed speech sample; that is, after the speaker has finished. Besides the technical differences between these two, is there any practical difference either to speakers or to listeners/computers related to this?

In terms of ecological validity, both real-time and latent judgments of fluency are possible. When we listen to a speaker for the first time, we might make a judgment within the first few seconds as to their speech fluency level. Naturally, as time goes on and we observe more of their speech, that judgment may be updated; hence, real-time measures. But there are times when we may listen to someone speak without necessarily making any judgments about their fluency (if it's unnecessary to the task at hand); for instance, we are listening only as a non-participating third party, but some time later, we have a need to judge their fluency level. Of course, depending on how latent that judgment is, our estimate might not be very accurate (we may not have been listening carefully enough to have any memory of the speaker's temporal or hesitation patterns).

Photo of Antony Gormley "Untitled (Listening)" by Amanda Slater. From https://commons.wikimedia.org/wiki/File:Antony_Gormley_%22Untitled_(Listening)%22_(5539005578).jpg

Qualitatively, there is less difference between real-time and latent judgments. Real-time judgments are effectively cumulative judgments. That is, a listener is not judging the speaker's speech at that very moment and only at that very moment. Rather, they are updating their overall judgment of the speech so far based on any new evidence present in the current moment; hence, a cumulative judgment. A latent judgment is also, effectively, a cumulative judgment, just after the end-point has been reached. So, in this sense, they are not so different. Perhaps the judgments might be very different at the outset as in the case when a speaker is very slow to get started on their speech. But as time goes on, the listener's real-time judgment should approach their latent judgment. (Though it may also be interesting to consider whether this is not the case—that perhaps making such judgments in real-time results in a final state that is different from a latent judgment.)

But what value does real-time information have? If some feedback is given to a speaker in real-time about their speech performance so far, can they actually make adjustments that affect their overall outcome for the better? (That emphasized part being quite important since it would be counter-productive to make speakers overly self-conscious if it actually makes them produce speech sounding less fluent.) I think it is clear that speakers are constantly changing their speech strategy throughout a speech; even if they are not speaking to someone directly. If it is truly spontaneous speech, then they will be judging whether the speech they have given has the intended effect on the listener and adapt their speech accordingly: speeding up if the audience seems impatient, repeating things if the listener looks confused, and so on. So, meaningful feedback could be actionable for the speaker. (Of course, what kind of feedback is actionable is yet another question to consider.)

Will such feedback be actionable at any time throughout their speech? Perhaps at the beginning of a long speaking turn, a speaker is not so cognizant of the listener's manner. The reason might be because at the start they have a large cognitive burden to decide how to start saying what they wish to say. That is, they need to decide their speech strategy. But once decided and speech has started, they may have more cognitive resources to notice the listener's manner.

In any case, though, I think there is a arguably a benefit to real-time feedback that is determined by the speech performance so far. But are listeners capable of providing that feedback? At the start of listening, they, too, might be trying extra hard to listen carefully and anticipate what the speaker is about to say. Thus, their feedback may not be quite as meaningful. That is, they may simply show a generic or neutral listening expression, and may not be able to update it until they have more relevant information about the speaker's performance. But as time progresses, I conjecture they will begin to show feedback to speakers that is potentially useful to speakers. They may not provide this feedback consciously, but at best sub-consciously, and perhaps even involuntarily.